ENVX1002 Introduction to Statistical Methods
The University of Sydney
May 2025
Consider measuring daily rainfall amounts in a coastal wetland. Some days have no rain. Most days have small to moderate amounts. Few days have very heavy rainfall.
This creates a pattern of variation in rainfall amounts, which at some point we can predict. This pattern is what we call a probability distribution (i.e. we can calculate the probability of a certain amount of rainfall on a given day).
A probability distribution is a function or equation that tells us:
It is like a mathematical recipe for how values are distributed.
Depending on the data, distributions can be:
We now combine all of what we have learnt about data in the last 3 weeks:
But how do we use these properties to make predictions? Do some distributions have special properties?
# Annotate our rainfall distribution with key properties
p_rainfall +
geom_vline(
xintercept = mean(rainfall_data$rainfall),
colour = "red", linetype = "dashed"
) +
annotate("text",
x = mean(rainfall_data$rainfall), y = 100,
label = "Mean", angle = 90, vjust = -0.5
) +
labs(
x = "Daily Rainfall (mm)",
y = "Frequency",
title = "Can we use the mean to predict rainfall?"
)Tree heights in a forest often follow a normal distribution:
# Set up the data
set.seed(123)
tree_heights <- data.frame(
height = rnorm(1000, mean = 20, sd = 3)
)
# Create plots for comparison
p1 <- ggplot(rainfall_data, aes(x = rainfall)) +
geom_histogram(aes(y = ..density..),
binwidth = 2,
fill = "steelblue",
colour = "white"
) +
labs(
x = "Daily Rainfall (mm)",
y = "Density",
title = "Skewed Distribution\n(Rainfall)"
)
p2 <- ggplot(tree_heights, aes(x = height)) +
geom_histogram(aes(y = ..density..),
binwidth = 0.5,
fill = "forestgreen",
colour = "white"
) +
labs(
x = "Tree Height (m)",
y = "Density",
title = "Normal Distribution\n(Tree Heights)"
)
# Arrange plots side by side
plot_grid(p1, p2, ncol = 2)The normal distribution is a natural phenomenon in many real-world situations – a fundamental concept in statistics.
We can use these properties to make reasonable predictions about tree heights:
# Calculate mean and standard deviations
mean_height <- mean(tree_heights$height)
sd_height <- sd(tree_heights$height)
# Create plot with standard deviation ranges
p2 +
# Add vertical lines for mean and SDs
geom_vline(
xintercept = mean_height,
colour = "darkred",
size = 1,
linetype = "dashed"
) +
# Add SD ranges
annotate("rect",
xmin = mean_height - sd_height,
xmax = mean_height + sd_height,
ymin = 0, ymax = Inf,
fill = "yellow", alpha = 0.2
) +
annotate("rect",
xmin = mean_height - 2 * sd_height,
xmax = mean_height + 2 * sd_height,
ymin = 0, ymax = Inf,
fill = "orange", alpha = 0.1
) +
# Add labels
annotate("text",
x = mean_height, y = 0.15,
label = "Mean", angle = 90, vjust = -0.5
) +
annotate("text",
x = mean_height - sd_height/2, y = 0.12,
label = "68% within\n1 SD", angle = 90, vjust = -0.5
) +
annotate("text",
x = mean_height - sd_height, y = 0.09,
label = "95% within\n2 SD", angle = 90, vjust = -0.5
) +
labs(
x = "Tree Height (m)",
y = "Density",
title = "Normal Distribution\n(Tree Heights)"
)A data that follows a normal distribution:
This pattern is the same for ALL normal distributions – a universal rule.
Importantly, it is a predictable pattern (although not exact).
# Calculate statistics for both distributions
tree_mean <- mean(tree_heights$height)
tree_sd <- sd(tree_heights$height)
rain_mean <- mean(rainfall_data$rainfall)
rain_sd <- sd(rainfall_data$rainfall)
# Create comparison plots showing fitted normal curves
p1 <- ggplot(tree_heights, aes(x = height)) +
geom_histogram(aes(y = ..density..),
binwidth = 0.5,
fill = "forestgreen",
colour = "white",
alpha = 0.7
) +
# Add fitted normal curve
stat_function(
fun = dnorm,
args = list(mean = tree_mean, sd = tree_sd),
colour = "darkred",
size = 1
) +
# Add mean line and 1 SD range
geom_vline(xintercept = tree_mean, colour = "blue", linetype = "dashed") +
annotate("rect",
xmin = tree_mean - tree_sd,
xmax = tree_mean + tree_sd,
ymin = 0, ymax = Inf,
fill = "yellow", alpha = 0.2
) +
labs(
x = "Tree Height (m)",
y = "Density",
title = "Normal Data:\nCan Fit Normal Distribution"
)
p2 <- ggplot(rainfall_data, aes(x = rainfall)) +
geom_histogram(aes(y = ..density..),
binwidth = 2,
fill = "steelblue",
colour = "white",
alpha = 0.7
) +
# Try to fit normal curve (shows poor fit)
stat_function(
fun = dnorm,
args = list(mean = rain_mean, sd = rain_sd),
colour = "darkred",
size = 1
) +
geom_vline(xintercept = rain_mean, colour = "blue", linetype = "dashed") +
annotate("rect",
xmin = rain_mean - rain_sd,
xmax = rain_mean + rain_sd,
ymin = 0, ymax = Inf,
fill = "yellow", alpha = 0.2
) +
labs(
x = "Rainfall (mm)",
y = "Density",
title = "Skewed Data:\nPoor Normal Fit"
)
plot_grid(p1, p2, ncol = 2)R provides three main functions that help us work with normal distributions. Think of them as different ways to ask questions about our data:
pnorm(): “What’s the probability?”R provides three main functions that help us work with normal distributions. Think of them as different ways to ask questions about our data:
qnorm(): “What’s the value?”R provides three main functions that help us work with normal distributions. Think of them as different ways to ask questions about our data:
dnorm(): “What’s the relative likelihood?”pnorm() for probabilitiesWhat proportion of trees in our forest are shorter than 23m, knowing that mean height is 20m and the sd is 3m?
prob_under_23 <- pnorm(q = 23, mean = 20, sd = 3) # Calculate proportion of trees under 23m
prob_under_23[1] 0.8413447
# Visualise the probability
percent_under_23 <- prob_under_23 * 100
ggplot(data.frame(x = c(10, 30)), aes(x = x)) +
stat_function(fun = dnorm, args = list(mean = 20, sd = 3)) +
geom_area(
stat = "function", fun = dnorm, args = list(mean = 20, sd = 3),
fill = "lightblue", alpha = 0.5, xlim = c(10, 23)
) +
geom_vline(xintercept = 23, linetype = "dashed", colour = "red") +
annotate("text",
x = 23, y = 0.02,
label = sprintf("%.1f%% of trees", percent_under_23),
hjust = -0.1
) +
labs(
x = "Tree Height (m)", y = "Density",
title = "Tree Heights Below 23m"
)qnorm() for thresholdsHow tall are the largest 10% of trees in our forest?
# Find the height threshold for tallest 10%
height_90th <- qnorm(p = 0.90, mean = 20, sd = 3)
height_90th[1] 23.84465
# Visualise the threshold
ggplot(data.frame(x = c(10, 30)), aes(x = x)) +
stat_function(fun = dnorm, args = list(mean = 20, sd = 3)) +
geom_area(stat = "function", fun = dnorm, args = list(mean = 20, sd = 3),
fill = "lightgreen", alpha = 0.5, xlim = c(height_90th, 30)) +
geom_vline(xintercept = height_90th, linetype = "dashed", colour = "red") +
annotate("text", x = height_90th, y = 0.02,
label = sprintf("Trees above %.1f m\nare in tallest 10%%", height_90th),
hjust = 1.1) +
labs(x = "Tree Height (m)", y = "Density",
title = "Identifying Tallest Trees")dnorm() for relative likelihoodWhich tree height is most common in our forest: 17m, 20m, or 23m? (Or, how likely are these heights?)
# Find relative likelihood at different heights
heights <- c(17, 20, 23) # Example heights to compare
densities <- dnorm(heights, mean = 20, sd = 3)
densities[1] 0.08065691 0.13298076 0.08065691
# Plot relative likelihoods
ggplot(data.frame(x = c(10, 30)), aes(x = x)) +
stat_function(fun = dnorm, args = list(mean = 20, sd = 3)) +
geom_point(data = data.frame(x = heights, y = densities),
aes(y = y), colour = "red", size = 3) +
geom_segment(data = data.frame(x = heights),
aes(x = x, xend = x, y = 0, yend = dnorm(x, 20, 3)),
linetype = "dashed", colour = "blue") +
labs(x = "Tree Height (m)", y = "Relative Likelihood",
title = "Height Probabilities in the Forest")Trees closest to 20m are most common, with likelihood decreasing as we move away.
These functions help us answer questions like:
# Probability between 18m and 22m - calculate the difference
prob_between <- pnorm(22, 20, 3) - pnorm(18, 20, 3)
prob_between[1] 0.4950149
We can convert any normal distribution to a standard form where:
X \sim N(\mu, \sigma^2) \rightarrow Z \sim N(0, 1)
Standardising helps us:
Formula: z = \frac{x - \mu}{\sigma} (subtract mean, divide by SD)
# Create data for two tree species
set.seed(123)
species1 <- rnorm(1000, mean = 20, sd = 3) # Tall species
species2 <- rnorm(1000, mean = 15, sd = 2) # Shorter species
# Plot original heights
data.frame(
height = c(species1, species2),
species = rep(c("Species 1", "Species 2"), each = 1000)
) |>
ggplot(aes(x = height, fill = species)) +
geom_density(alpha = 0.5) +
labs(x = "Tree Height (m)", y = "Density",
title = "Original Height Distributions")# Convert to z-scores
z1 <- scale(species1)
z2 <- scale(species2)
# Plot standardised distributions
data.frame(
zscore = c(z1, z2),
species = rep(c("Species 1", "Species 2"), each = 1000)
) |>
ggplot(aes(x = zscore, fill = species)) +
geom_density(alpha = 0.5) +
labs(x = "Z-score", y = "Density",
title = "Standardised Distributions")A z-score of 2 means:
Once we standardise our data to z-scores, we can calculate probabilities without knowing the original mean or standard deviation.
# Create a visual of the standard normal curve
ggplot(data.frame(x = c(-3.5, 3.5)), aes(x = x)) +
stat_function(fun = dnorm, size = 1, colour = "black") +
geom_vline(xintercept = 0, linetype = "dashed", colour = "blue") +
annotate("text", x = 0.3, y = 0.1, label = "Mean = 0") +
labs(x = "Z-score", y = "Density",
title = "Standard Normal Distribution")pnorm(): What proportion of values fall below a given z-score?qnorm(): What z-score corresponds to a specific percentile?dnorm(): How likely is a specific z-score to occur?Let’s see how these help us understand our tree height data:
# Example: For a tree with z-score = 1.5 (taller than average)
# What percentage of trees are shorter than this tree?
pnorm(1.5)[1] 0.9331928
# Example: How tall must a tree be to be in the tallest 5%?
# (Find the z-score at the 95th percentile)
qnorm(0.95)[1] 1.644854
# Example: Compare likelihood of average height vs. very tall trees
dnorm(0)/dnorm(2) # How much more common is z=0 compared to z=2[1] 7.389056
The blue shaded area shows pnorm(1) = 0.84, meaning 84% of trees have a z-score less than 1.
# Create a visual showing pnorm in action
ggplot(data.frame(x = c(-3.5, 3.5)), aes(x = x)) +
stat_function(fun = dnorm, size = 1, colour = "black") +
geom_area(stat = "function", fun = dnorm,
fill = "lightblue", alpha = 0.5, xlim = c(-3.5, 1)) +
geom_vline(xintercept = 1, linetype = "dashed", colour = "red") +
annotate("text", x = -1, y = 0.3,
label = "pnorm(1) = 0.84\n84% of values\nbelow z=1") +
labs(x = "Z-score", y = "Density",
title = "Using pnorm() with the Standard Normal Curve")Different data types need different distributions:
# Create sequence for x-axis
x <- seq(-4, 4, length = 100)
# Create data frame for plotting
distributions <- data.frame(
x = rep(x, 3),
density = c(
dt(x, df = 5), # t-distribution
dchisq(x + 4, df = 3),# chi-square (shifted)
dnorm(x) # normal for comparison
),
Distribution = rep(c("t", "Chi-square", "Normal"), each = 100)
)
# Plot distributions
ggplot(distributions, aes(x = x, y = density, colour = Distribution)) +
geom_line(size = 1) +
labs(x = "Value", y = "Density",
title = "Different Types of Data, Different Distributions") +
theme(legend.position = "bottom")We’ll learn when to use each one as we need them.
A sample is a subset of a population used to represent the entire group. It allows us to make inferences about the population without examining every member.
A sampling distribution tells us:
Think of it as distribution of sample statistics, not individual measurements. Your variable and all its values are a single number in this distribution.
Let’s take samples from our skewed rainfall data and see what happens:
[1] 6.49786
If we take another sample, we get a different mean:
[1] 6.656836
Each sample gives us a slightly different estimate of the true mean rainfall.
What happens if we take many samples and look at all their means?
# Take 1000 samples of size 5 and calculate their means
set.seed(456)
rain_means_n5 <- replicate(1000, mean(sample(rainfall_population, size = 5)))
# Plot the distribution of sample means
ggplot(data.frame(sample_mean = rain_means_n5), aes(x = sample_mean)) +
geom_histogram(binwidth = 0.5, fill = "skyblue", colour = "white") +
geom_vline(xintercept = mean(rainfall_population),
colour = "red", linetype = "dashed") +
annotate("text", x = mean(rainfall_population) + 1, y = 80,
label = "Population mean", colour = "red") +
labs(x = "Sample Mean Rainfall (mm)", y = "Frequency",
title = "Distribution of Sample Means (n = 5)")Notice how the sample means are distributed around the true population mean, but the distribution is still somewhat skewed.
What happens when we increase our sample size?
# Take 1000 samples of different sizes
set.seed(404)
rain_means_n1 <- replicate(1000, mean(sample(rainfall_population, size = 1)))
# We already have n5 from previous slide
rain_means_n15 <- replicate(1000, mean(sample(rainfall_population, size = 15)))
rain_means_n20 <- replicate(1000, mean(sample(rainfall_population, size = 20)))
rain_means_n30 <- replicate(1000, mean(sample(rainfall_population, size = 30)))
rain_means_n100 <- replicate(1000, mean(sample(rainfall_population, size = 100)))
# Combine data for comparison with more sample sizes
rain_means_df <- data.frame(
sample_mean = c(rain_means_n1, rain_means_n5, rain_means_n15,
rain_means_n20, rain_means_n30, rain_means_n100),
sample_size = factor(rep(c(1, 5, 15, 20, 30, 100), each = 1000),
levels = c(1, 5, 15, 20, 30, 100))
)
# Create faceted plot
ggplot(rain_means_df, aes(x = sample_mean)) +
geom_density(fill = "skyblue", alpha = 0.7) +
geom_vline(xintercept = mean(rainfall_population),
colour = "red", linetype = "dashed") +
facet_wrap(~ sample_size, ncol = 3, scales = "free_y",
labeller = labeller(sample_size = function(x) paste("n =", x))) +
labs(x = "Sample Mean Rainfall (mm)", y = "Density",
title = "Effect of Sample Size on Distribution of Sample Means") +
theme(strip.background = element_rect(fill = "lightgrey"),
strip.text = element_text(size = 12, face = "bold"))As sample size increases, something remarkable happens:
What happens when we increase our sample size?
ggplot(rain_means_df, aes(x = sample_mean)) +
geom_density(fill = "skyblue", alpha = 0.7) +
geom_vline(xintercept = mean(rainfall_population),
colour = "red", linetype = "dashed") +
facet_wrap(~ sample_size, ncol = 3, scales = "free",
labeller = labeller(sample_size = function(x) paste("n =", x))) +
labs(x = "Sample Mean Rainfall (mm)", y = "Density",
title = "Effect of Sample Size on Distribution of Sample Means") +
theme(strip.background = element_rect(fill = "lightgrey"),
strip.text = element_text(size = 12, face = "bold"))As sample size increases, something remarkable happens:
I know of scarcely anything so apt to impress the imagination as the wonderful form of cosmic order expressed by the law of frequency of error. The law would have been personified by the Greeks if they had known of it. It reigns with serenity and complete self-effacement amidst the wildest confusion. The larger the mob, the greater the apparent anarchy, the more perfect is its sway. It is the supreme law of unreason.”
– Sir Francis Galton (source) (1822-1911)
The Central Limit Theorem (CLT) states that:
When we take sufficiently large random samples from any population:
This is true regardless of the shape of the original population distribution.
Let’s compare our original skewed rainfall data with the distribution of sample means:
# Plot original population and sampling distributions
p1 <- ggplot(data.frame(x = rainfall_population), aes(x = x)) +
geom_histogram(aes(y = ..density..), binwidth = 1,
fill = "steelblue", colour = "white") +
stat_function(fun = dnorm,
args = list(mean = mean(rainfall_population),
sd = sd(rainfall_population)),
colour = "red", linetype = "dashed") +
labs(x = "Rainfall (mm)", y = "Density",
title = "Original Population\n(Skewed)")
p2 <- ggplot(data.frame(x = rain_means_n30), aes(x = x)) +
geom_histogram(aes(y = ..density..), binwidth = 0.5,
fill = "skyblue", colour = "white") +
stat_function(fun = dnorm,
args = list(mean = mean(rain_means_n30),
sd = sd(rain_means_n30)),
colour = "red") +
labs(x = "Sample Mean Rainfall (mm)", y = "Density",
title = "Sampling Distribution\n(n = 30)")
plot_grid(p1, p2, ncol = 2)Even with skewed data, the sampling distribution becomes normal as sample size increases!
In biology, many processes have binary outcomes: a gene is expressed or not, a cell divides or not, an organism survives or dies. Let’s demonstrate the CLT with gene expression data:
# Create a binary population (gene expression: 0 = off, 1 = on)
# Assume the gene is expressed in 30% of cells
set.seed(123)
n_cells <- 100000
p_expressed <- 0.3
gene_expression <- rbinom(n_cells, size = 1, prob = p_expressed)
# Plot the original population distribution
ggplot(data.frame(expression = factor(gene_expression)), aes(x = expression)) +
geom_bar(fill = "purple") +
labs(x = "Gene Expression (0 = off, 1 = on)", y = "Frequency",
title = "Original Population: Binary Gene Expression",
subtitle = "Gene expressed in 30% of cells")This is a highly non-normal distribution - it’s discrete with only two possible values!
# Define sample sizes (powers of 2)
sample_sizes <- c(1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048)
# Function to generate sample means for a given sample size
generate_sample_means <- function(sample_size, n_samples = 1000) {
replicate(n_samples, mean(sample(gene_expression, size = sample_size, replace = TRUE)))
}
# Generate sample means for each sample size
set.seed(456)
all_sample_means <- lapply(sample_sizes, generate_sample_means)
names(all_sample_means) <- paste("n =", sample_sizes)
# Combine into a data frame
sample_means_df <- data.frame(
sample_mean = unlist(all_sample_means),
sample_size = factor(rep(names(all_sample_means), each = 1000),
levels = names(all_sample_means))
)
# Create faceted plot with density plots instead of histograms
ggplot(sample_means_df, aes(x = sample_mean)) +
geom_density(fill = "purple", color = "white", alpha = 0.7) +
geom_vline(xintercept = mean(gene_expression),
color = "red", linetype = "dashed") +
facet_wrap(~ sample_size, scales = "free_y", ncol = 4) +
labs(x = "Sample Mean (Proportion of Cells with Gene Expressed)",
y = "Density",
subtitle = "Red line shows true population mean (0.3)") +
theme(strip.background = element_rect(fill = "lavender"),
strip.text = element_text(size = 10, face = "bold"))Even with the most non-normal data possible (binary data), the sampling distribution of means becomes normal with sufficient sample size.
Let’s explore another example with a discrete, highly skewed distribution:
In ecological restoration, practitioners need to estimate germination rates of native seeds. Imagine each seed has only a 10% chance of germinating (p = 0.1).
# Create a binomial population - number of germinated seeds in batches of 20
set.seed(789)
n_trials <- 20 # 20 seeds per batch
p_success <- 0.1 # 10% germination rate
n_batches <- 10000 # 10,000 batches to create our "population"
# Generate the population data
germination_population <- rbinom(n_batches, size = n_trials, prob = p_success)
# Plot the population distribution
ggplot(data.frame(germinated = germination_population), aes(x = germinated)) +
geom_bar(fill = "darkgreen", color = "white") +
scale_x_continuous(breaks = 0:10) +
labs(x = "Number of Seeds Germinated (out of 20)",
y = "Frequency",
title = "Seed Germination Population Distribution",
subtitle = "Binomial(n=20, p=0.1): Discrete and highly skewed")This distribution is: - Discrete (only whole numbers of seeds can germinate) - Highly skewed (most batches have very few germinating seeds) - Very different from our continuous rainfall example
What happens when we take samples from this discrete, skewed distribution?
# Take samples of different sizes and calculate means
set.seed(123)
seed_means_n5 <- replicate(1000, mean(sample(germination_population, size = 5)))
seed_means_n30 <- replicate(1000, mean(sample(germination_population, size = 30)))
# Calculate theoretical standard error for n=30
se_n30 <- sd(germination_population)/sqrt(30)
# Plot the original population vs. sampling distributions
p1 <- ggplot(data.frame(x = germination_population), aes(x = x)) +
geom_bar(fill = "darkgreen", color = "white") +
scale_x_continuous(breaks = seq(0, 10, by = 2)) +
labs(x = "Seeds Germinated", y = "Frequency",
title = "Original Population\n(Discrete & Skewed)")
p2 <- ggplot(data.frame(x = seed_means_n5), aes(x = x)) +
geom_histogram(aes(y = ..density..), binwidth = 0.2,
fill = "lightgreen", colour = "white") +
geom_vline(xintercept = mean(germination_population),
colour = "red", linetype = "dashed") +
labs(x = "Sample Mean (n = 5)", y = "Density",
title = "Sampling Distribution\nSmall Samples")
# For n=30, use density scale and properly scaled normal curve
p3 <- ggplot(data.frame(x = seed_means_n30), aes(x = x)) +
geom_histogram(aes(y = ..density..), binwidth = 0.1,
fill = "lightgreen", colour = "white") +
geom_vline(xintercept = mean(germination_population),
colour = "red", linetype = "dashed") +
stat_function(fun = dnorm,
args = list(mean = mean(germination_population),
sd = se_n30),
colour = "red", size = 1) +
labs(x = "Sample Mean (n = 30)", y = "Density",
title = "Sampling Distribution\nLarger Samples") +
# Set x-axis limits to focus on the relevant range
xlim(mean(germination_population) - 4*se_n30,
mean(germination_population) + 4*se_n30)
# Arrange plots
plot_grid(p1, p2, p3, ncol = 3)Even with discrete, highly skewed data, the CLT still applies:
The standard error (SE) tells us how much sample means typically vary:
SE = \frac{\sigma}{\sqrt{n}}
Where: - \sigma is the population standard deviation - n is the sample size
# Calculate theoretical standard errors
pop_sd <- sd(rainfall_population)
se_n5 <- pop_sd / sqrt(5)
se_n15 <- pop_sd / sqrt(15)
se_n30 <- pop_sd / sqrt(30)
# Compare standard errors
se_n5[1] 2.89159
[1] 1.66946
[1] 1.180487
As sample size increases, standard error decreases, making our estimates more precise.
While we have been using the standard deviation to measure variability in our data, the standard error measures variability in our sample means.
# Create comparison table between Standard Deviation and Standard Error
sd_se_comparison <- data.frame(
Aspect = c("Definition", "Formula", "Measures", "Effect of sample size", "Typical use"),
Standard_Deviation = c(
"Measures spread of individual data points around their mean",
"σ = √[Σ(x-μ)²/N] or s = √[Σ(x-x̄)²/(n-1)]",
"Variability within a dataset",
"Unaffected by sample size",
"Describing the spread of data in a single sample"
),
Standard_Error = c(
"Measures precision of a sample statistic (usually the mean)",
"SE = σ/√n or SE = s/√n",
"Variability between sample means from the same population",
"Decreases as sample size increases (∝ 1/√n)",
"Inferential statistics, confidence intervals, hypothesis testing"
)
)
kable(sd_se_comparison,
col.names = c("Aspect", "Standard Deviation (SD)", "Standard Error (SE)"),
caption = "Standard Deviation vs. Standard Error")| Aspect | Standard Deviation (SD) | Standard Error (SE) |
|---|---|---|
| Definition | Measures spread of individual data points around their mean | Measures precision of a sample statistic (usually the mean) |
| Formula | σ = √[Σ(x-μ)²/N] or s = √[Σ(x-x̄)²/(n-1)] | SE = σ/√n or SE = s/√n |
| Measures | Variability within a dataset | Variability between sample means from the same population |
| Effect of sample size | Unaffected by sample size | Decreases as sample size increases (∝ 1/√n) |
| Typical use | Describing the spread of data in a single sample | Inferential statistics, confidence intervals, hypothesis testing |
The Central Limit Theorem enables modern inferential statistics:
This presentation is based on the SOLES Quarto reveal.js template and is licensed under a Creative Commons Attribution 4.0 International License.